skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Chiu, Sin-Wei"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Quantum computers pose a significant threat to modern cryptographic systems by efficiently solving problems such as integer factorization through Shor’s algorithm. Homomorphic encryption (HE) schemes based on ring learning with errors (Ring-LWE) offer a quantum-resistant framework for secure computations on encrypted data. Many of these schemes rely on polynomial multiplication, which can be efficiently accelerated using the number theoretic transform (NTT) in leveled HE, ensuring practical performance for privacy-preserving applications. This article presents a novel NTT-based serial pipelined multiplier that achieves full-hardware utilization through interleaved folding, and overcomes the 50% under-utilization limitation of the conventional serial R2MDC architecture. In addition, it explores tradeoffs in pipelined parallel designs, including serial, 2-parallel, and 4-parallel architectures. Our designs leverage increased parallelism, efficient folding techniques, and optimizations for a selected constant modulus to achieve superior throughput (TP) compared with state-of-the-art implementations. While the serial fold design minimizes area consumption, the 4-parallel design maximizes TP. Experimental results on the Virtex-7 platform demonstrate that our architectures achieve at least 2.22 times higher TP/area for a polynomial length of 1024 and 1.84 times for a polynomial length of 4096 in the serial fold design, while the 4-parallel design achieves at least 2.78 times and 2.79 times, respectively. The efficiency gain is even more pronounced in TP squared over area, where the serial fold and 4-parallel designs outperform prior works by at least 4.98 times and 26.43 times for a polynomial length of 1024 and 6.7 times and 43.77 times for a polynomial length of 4096, respectively. These results highlight the effectiveness of our architectures in balancing performance, area efficiency, and flexibility, making them well-suited for high-speed cryptographic applications. 
    more » « less
    Free, publicly-accessible full text available June 11, 2026
  2. Convolution is a fundamental operation with diverse applications in signal processing, computer vision, and machine learning. This article reviews three distinct convolutions: linear convolution (also referred to as aperiodic convolution), positive-wrapped convolution (PWC) (also known as circular convolution), and negative-wrapped convolution (NWC). Additionally, we propose an alternative approach to computing linear convolution without zero padding by leveraging the PWC and NWC. We compare two fast Fourier transform (FFT)-based methods to compute linear convolution: the traditional zero-padded PWC method and a new method based on the PWC and NWC. Through a detailed analysis of the flowgraphs (FGs), we demonstrate the equivalence of these methods while highlighting their unique characteristics. We show that computing the NWC using the weighted PWC method is equivalent to a part of the linear convolution computation with zero padding. Furthermore, it is possible to extract the PWC and NWC from structures to compute linear convolution with zero padding, where the last butterfly stage can be eliminated. This article aims to establish a clear connection among PWC, NWC, and linear convolution, illustrating new perspectives on computing different convolutions. 
    more » « less
    Free, publicly-accessible full text available May 1, 2026
  3. Polynomial multiplication over the quotient ring is a critical operation in Ring Learning with Errors (Ring-LWE) based cryptosystems that are used for post-quantum cryptography and homomorphic encryption. This operation can be efficiently implemented using number-theoretic transform (NTT)-based architectures. Among these, pipelined parallel NTTbased polynomial multipliers are attractive for cloud computing as these are well suited for high throughput and low latency applications. For a given polynomial length, a pipelined parallel NTT-based multiplier can be designed with varying degrees of parallelism, resulting in different tradeoffs. Higher parallelism reduces latency but increases area and power consumption,and vice versa. In this paper, we develop a predictive model based on synthesized results for pipelined parallel NTT-based polynomial multipliers and analyze design tradeoffs in terms of area, power, energy, area-time product, and area-energy product across parallelism levels up to 128. We predict that, for very long polynomials, area and power differences between designs with varying levels of parallelism become negligible. In contrast, areatime product and energy per polynomial multiplication decrease with increased parallelism. Our findings suggest that, given area and power constraints, the highest feasible level of parallelism optimizes latency, area-time product, and energy per polynomial multiplication. 
    more » « less
  4. This tutorial aims to establish connections between polynomial modular multiplication over a ring to circular convolution and the discrete Fourier transform (DFT). The main goal is to extend the well-known theory of the DFT in signal processing (SP) to other applications involving polynomials in a ring, such as homomorphic encryption (HE). 
    more » « less
  5. Data privacy has become a significant concern due to the rapid development of cloud services, Internet of Things, edge devices, and other applications. Homomorphic encryption (HE) addresses the issue by enabling computations to be performed without the decryption of the encrypted message. However, the bottleneck of designing homomorphic encryption hardware is the complexity of computation. To tackle the long integer arithmetic, the residue number system based on the Chinese remainder theorem is used. In this paper, we propose a novel modular reduction architecture that computes the mapping of residual polynomials in parallel with high speed and low latency. We implement our proposed design in the Xilinx Ultrascale+ FPGA board (VCU118). When the input sizes are 360-bit (1440-bit), the frequency is 180MHz (168MHz) with 4 pipelining stages. Also, the area delay product (ADP) of DSP blocks of our design is reduced by 23 and 31 percent, respectively, for 360 and 1440 bits, compared to prior work. 
    more » « less
  6. High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes three novel contributions. First, cascaded parallel NTT and iNTT architectures are proposed such that any buffer requirement for permuting the product of the NTTs before it is input to the iNTT is eliminated. This is achieved by using different folding sets for the NTTs and iNTT. Second, a novel approach to expand the set of feasible special moduli is presented where the moduli can be expressed in terms of a few signed power-of-two terms. Third, novel architectures for pre-processing for computing residual polynomials using the CRT and post-processing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate such as in the cloud, as these feed-forward architectures can be pipelined at arbitrary levels. Pipelining and latency tradeoffs are also investigated. Compared to a prior design, the proposed architecture reduces latency by a factor of 49.2, and the area-time products (ATP) for the lookup table and DSP, ATP(LUT) and ATP(DSP), respectively, by 89.2% and 92.5%. Specifically, we show that for n =4096 and a 180-bit coefficient, the proposed 2-parallel architecture requires 6.3 Watts of power while operating at 240 MHz, with 6 moduli, each of length 30 bits, using Xilinx Virtex Ultrascale+ FPGA. 
    more » « less